Executive summary

Research goal

Methodology

Main Takeaways

Conclusion

1. Introduction

There are several ways to learn and share knowledge nowadays. One of the most popular forms of learning and sharing is a short video (video clips). We are all interested in this method and would like to apply both machine-learning and text-mining techniques to real-world data to analyse video classification from its transcript or subtitle. At the beginning of the project, we had many streaming websites and video podcasts on the table, such as YouTube, BBC Learning English, Apple podcasts, and TED talk. All of them are built as friendly user platforms that we are all familiar with. TED was finally selected for this project due to its provided data we could employ in our study. TED also contains a wide range of videos in terms of topics, languages, and length of videos. More importantly, each video on the TED talk website is labelled by a relevant category and provides a text transcript.

Therefore, the goal of this project is to, first, use sentiment analysis to identify opinions, judgements or feelings about what TED speakers have said about each TED talk topic. Secondly, we aim to use topic analysis to analyse and cluster the videos and compare them with the categories labelled by the TED talk website. Lastly, we will apply a text classification technique to predict the topics of the new videos.

The remainder of the project is organized as follows. Section 2 describes the data and the web scraping. Section 3 presents tokenization. Section 4 presents exploratory data analysis. Section 5 presents sentiment analysis. Section 6 performs topic modelling analysis. Section 7 performs embedding analysis. Section 8 performs the supervised analysis. The main results of this study and a brief limitation and possible further study of the research are then presented in Section 9.

2. Data Preparation

We acquire the transcript text from each TED talks’ videos, we scraped the text from TED website, and our scraping was in the following order:

  • Use RSelenium package open TED website.

  • Go to TED Talks session by clicking the navigation bar button.

  • Select language, topics, and the sort by to specify the range of video types.

  • As the structure of TED website is not stable, it changes over time. In this case, we failed to click in each videos’ page to scrape data immediately. Thus, we first scrape the videos’ title name from the browser result page after the third step. The output in this step will be a data frame including the names of all videos we want to scrape further.

  • Click in the search box, use the videos’ title name to search the corresponding video then always click the first result after searching by using it’s xpath.

  • After clicking in each video’s page, we first click Read transcript button to extent the transcript text area. Then, we begin to scrape the all related information that we might use in the following analysis.

  • After scraping the information of each videos, there is a step of going back to the browser result page.

As mentioned above, since during the process of scraping TED data, we found the for loop of clicking in each video and scraping text is often interrupted, and some xpaths would fail to use in the case of different day operations. In this case, we have adopted the following response methods:

  • As we obtain the list of videos’ title name first, we use the list of previously successfully obtained video information before interrupting to compare with the list of title names to obtain the list of videos to continue scraping.
  • We take turns using css, xpath,link text and partial link text four approaches to locate the position of the videos or the button of Read transcript and Next.
  • Considering this is a dynamic web crawling, we add Sys.sleep to each scraping and clicking step, so that the system can give the website react time.

Finally, we set the closing function at the end in case closing the browser incorrectly would influence future scraping the next time.

3. Tokenization

Wrangling and parsing data

After scraping data from TED.com website, we imported two tables saved in .csv format in the data folder, namely TED.csv and add_details_1.csv. The original tables consist of 330 observations with 11 variables and 310 observations with 2 variables, respectively.

We then removed duplicated observations for both tables and combined two tables by title column and named it as TED. TED table currently contains 324 observations with 12 variables. However, we would like to focus just 6 interesting variables, which were title of videos (title), when the videos were posted (posted), topic of videos (cate), the number of likes for videos (likes), transcript (tanscript), and the number of views of videos (views_details), for our further analyses, so we selected them and removed the rest. The title variable would be employed for only a sentiment analysis. Hence, we stored all 6 variables in TED_sentiment, which performs the main table of the sentiment analysis, and removed the title variable from TED.

We also spotted 34 missing values (NAs) in TED and they were removed later. Therefore, TED have 286 observations which are 103 videos from AI, 86 videos from Climate change, and 97 videos from Relationships.

## [1] 6
##   [1] "How does artificial intelligence learn?"                                                    
##   [2] "The danger of AI is weirder than you think"                                                 
##   [3] "The wonderful and terrifying implications of computers that can learn"                      
##   [4] "How do we find dignity at work?"                                                            
##   [5] "The incredible inventions of intuitive AI"                                                  
##   [6] "How AI can bring on a second Industrial Revolution"                                         
##   [7] "How AI can enhance our memory, work and social lives"                                       
##   [8] "We're building a dystopia just to make people click on ads"                                 
##   [9] "How AI can save our humanity"                                                               
##  [10] "Why fascism is so tempting — and how your data could power it"                              
##  [11] "How AI can help shatter barriers to equality"                                               
##  [12] "A future worth getting excited about"                                                       
##  [13] "What happens in your brain when you pay attention?"                                         
##  [14] "Intelligent floating machines inspired by nature"                                           
##  [15] "What if you could sing in your favorite musician's voice?"                                  
##  [16] "Meet Milo, the virtual boy"                                                                 
##  [17] "Can a computer write poetry?"                                                               
##  [18] "Building \"self-aware\" robots"                                                             
##  [19] "How to get empowered, not overpowered, by AI"                                               
##  [20] "A bold idea to replace politicians"                                                         
##  [21] "The medical potential of AI and metabolites"                                                
##  [22] "An AI smartwatch that detects seizures"                                                     
##  [23] "Art in the age of machine intelligence"                                                     
##  [24] "How do self-driving cars \"see\"?"                                                          
##  [25] "A new equation for intelligence"                                                            
##  [26] "How we'll earn money in a future without jobs"                                              
##  [27] "The ethical dilemma of self-driving cars"                                                   
##  [28] "AI-generated creatures that stretch the boundaries of imagination"                          
##  [29] "A new way to restore Earth's biodiversity — from the air"                                   
##  [30] "Get ready for hybrid thinking"                                                              
##  [31] "Can we build AI without losing control over it?"                                            
##  [32] "A friendly, autonomous robot that delivers your food"                                       
##  [33] "Is humanity smart enough to survive itself?"                                                
##  [34] "How AI could empower any business"                                                          
##  [35] "4 lessons from robots about being human"                                                    
##  [36] "The value of your humanity in an automated future"                                          
##  [37] "How do we learn to work with intelligent machines?"                                         
##  [38] "How AI could become an extension of your mind"                                              
##  [39] "3 myths about the future of work (and why they're not true)"                                
##  [40] "Watson, Jeopardy and me, the obsolete know-it-all"                                          
##  [41] "How bad data keeps us from good AI"                                                         
##  [42] "How we're teaching computers to understand pictures"                                        
##  [43] "Technology that knows what you're feeling"                                                  
##  [44] "Where's Google going next?"                                                                 
##  [45] "How to keep human bias out of AI"                                                           
##  [46] "Machine intelligence makes human morals more important"                                     
##  [47] "The line between life and not-life"                                                         
##  [48] "Robots that fly ... and cooperate"                                                          
##  [49] "How computers are learning to be creative"                                                  
##  [50] "Don't fear superintelligent AI"                                                             
##  [51] "Why we need to imagine different futures"                                                   
##  [52] "How computers learn to recognize objects instantly"                                         
##  [53] "How we can build AI to help humans, not hurt us"                                            
##  [54] "Siri, Alexa, Google ... what comes next?"                                                   
##  [55] "How civilization could destroy itself — and 4 ways we could prevent it"                     
##  [56] "3 principles for creating safer AI"                                                         
##  [57] "The 4 greatest threats to the survival of humanity"                                         
##  [58] "The rise of personal robots"                                                                
##  [59] "Connected, but alone?"                                                                      
##  [60] "Robots will invade our lives"                                                               
##  [61] "An animated tour of the invisible"                                                          
##  [62] "How we can teach computers to make sense of our emotions"                                   
##  [63] "A fascinating time capsule of human feelings toward AI"                                     
##  [64] "How humans and AI can work together to create better businesses"                            
##  [65] "Can machines read your emotions?"                                                           
##  [66] "How brain science will change computing"                                                    
##  [67] "What moral decisions should driverless cars make?"                                          
##  [68] "AI isn't as smart as you think — but it could be"                                           
##  [69] "What happens when our computers get smarter than we are?"                                   
##  [70] "The human skills we need in an unpredictable world"                                         
##  [71] "The jobs we'll lose to machines — and the ones we won't"                                    
##  [72] "Can we learn to talk to sperm whales?"                                                      
##  [73] "A new way to monitor vital signs (that can see through walls)"                              
##  [74] "Why people and AI make good business partners"                                              
##  [75] "What would happen if we upload our brains to computers?"                                    
##  [76] "Robots with \"soul\""                                                                       
##  [77] "How I'm using biological data to tell better stories — and spark social change"             
##  [78] "Silicon-based comedy"                                                                       
##  [79] "The rise of human-computer cooperation"                                                     
##  [80] "The Greek myth of Talos, the first robot"                                                   
##  [81] "A future with fewer cars"                                                                   
##  [82] "What intelligent machines can learn from a school of fish"                                  
##  [83] "How AI is making it easier to diagnose disease"                                             
##  [84] "The real reason for brains"                                                                 
##  [85] "My seven species of robot — and how we created them"                                        
##  [86] "The race to build AI that benefits humanity with Sam Altman"                                
##  [87] "How deepfakes undermine truth and threaten democracy"                                       
##  [88] "Can a robot pass a university entrance exam?"                                               
##  [89] "What AI is — and isn't"                                                                     
##  [90] "Could you recover from illness ... using your own stem cells?"                              
##  [91] "How new technology helps blind people explore the world"                                    
##  [92] "How AI could compose a personalized soundtrack to your life"                                
##  [93] "How to be \"Team Human\" in the digital future"                                             
##  [94] "6 big ethical questions about the future of AI"                                             
##  [95] "What is deep tech? A look at how it could shape the future"                                 
##  [96] "Don't fear intelligent machines. Work with them"                                            
##  [97] "How I'm fighting bias in algorithms"                                                        
##  [98] "Fake videos of real people — and how to spot them"                                          
##  [99] "How we're using AI to discover new antibiotics"                                             
## [100] "A funny look at the unintended consequences of technology"                                  
## [101] "How to get better at video games, according to babies"                                      
## [102] "A sci-fi vision of life in 2041"                                                            
## [103] "We're covered in germs. Let's design for that."                                             
## [104] "Our moral imperative to act on climate change — and 3 steps we can take (English voiceover)"
## [105] "Global warming's theme song, \"Manhattan in January\""                                      
## [106] "A 40-year plan for energy"                                                                  
## [107] "Can the ocean run out of oxygen?"                                                           
## [108] "Whatever happened to acid rain?"                                                            
## [109] "How we'll resurrect the gastric brooding frog, the Tasmanian tiger"                         
## [110] "Fusion is energy's future"                                                                  
## [111] "An urgent call to protect the world's \"Third Pole\""                                       
## [112] "Humanity's planet-shaping powers — and what they mean for the future"                       
## [113] "The energy Africa needs to develop — and fight climate change"                              
## [114] "The secrets I find on the mysterious ocean floor"                                           
## [115] "American bipartisan politics can be saved — here's how"                                     
## [116] "Why wildfires have gotten worse — and what we can do about it"                              
## [117] "A wide-angle view of fragile Earth"                                                         
## [118] "The innovations we need to avoid a climate disaster"                                        
## [119] "What farmers need to be modern, climate-friendly and profitable"                            
## [120] "Why we should archive everything on the planet"                                             
## [121] "How shocking events can spark positive change"                                              
## [122] "Why you should be a climate activist"                                                       
## [123] "This sea creature breathes through its butt"                                                
## [124] "Global priorities bigger than climate change"                                               
## [125] "The \"myth\" of the boiling frog"                                                           
## [126] "Why bees are disappearing"                                                                  
## [127] "Vultures: The acid-puking, plague-busting heroes of the ecosystem"                          
## [128] "How small countries can make a big impact on climate change"                                
## [129] "Why I protest for climate justice"                                                          
## [130] "Why act now?"                                                                               
## [131] "The state of the climate — and what we might do about it"                                   
## [132] "Metal that breathes"                                                                        
## [133] "3 thoughtful ways to conserve water"                                                        
## [134] "Hooked by an octopus"                                                                       
## [135] "Why is the world warming up?"                                                               
## [136] "A reality check on renewables"                                                              
## [137] "How to turn climate anxiety into action"                                                    
## [138] "What comes after An Inconvenient Truth?"                                                    
## [139] "What if cracks in concrete could fix themselves?"                                           
## [140] "A small country with big ideas to get rid of fossil fuels"                                  
## [141] "We need to track the world's water like we track the weather"                               
## [142] "How to shift your mindset and choose your future"                                           
## [143] "Can seaweed help curb global warming?"                                                      
## [144] "What to do when climate change feels unstoppable"                                           
## [145] "The case for optimism on climate change"                                                    
## [146] "The magic of the Amazon: A river that flows invisibly all around us"                        
## [147] "How to transform sinking cities into landscapes that fight floods"                          
## [148] "What nature can teach us about sustainable business"                                        
## [149] "The ocean's ingenious climate solutions"                                                    
## [150] "3 ways your company's data can jump-start climate action"                                   
## [151] "Life in Biosphere 2"                                                                        
## [152] "Emergency medicine for our climate fever"                                                   
## [153] "How to be a good ancestor"                                                                  
## [154] "Community investment is the missing piece of climate action"                                
## [155] "The untapped energy source that could power the planet"                                     
## [156] "It's impossible to have healthy people on a sick planet"                                    
## [157] "How China is (and isn't) fighting pollution and climate change"                             
## [158] "5 transformational policies for a prosperous and sustainable world"                         
## [159] "A new way to remove CO2 from the atmosphere"                                                
## [160] "A bold plan to protect 30 percent of the Earth's surface and ocean floor"                   
## [161] "The \"greenhouse-in-a-box\" empowering farmers in India"                                    
## [162] "Where does all the carbon we release go?"                                                   
## [163] "What if there were 1 trillion more trees?"                                                  
## [164] "The hidden wonders of soil"                                                                 
## [165] "Africa's great carbon valley — and how to end energy poverty"                               
## [166] "The eco-creators helping the climate through social media"                                  
## [167] "How to find joy in climate action"                                                          
## [168] "An interactive map to track (and end) pollution in China"                                   
## [169] "How we can curb climate change by spending two percent more on everything"                  
## [170] "The wonderful world of life in a drop of water"                                             
## [171] "Can clouds buy us more time to solve climate change?"                                       
## [172] "A new economic model for protecting tropical forests "                                      
## [173] "What seaweed and cow burps have to do with climate change"                                  
## [174] "How to make radical climate action the new normal"                                          
## [175] "The 55 gigaton challenge"                                                                   
## [176] "Why don't we cover the desert with solar panels?"                                           
## [177] "The science behind a climate headline"                                                      
## [178] "The race to a zero-emission world starts now"                                               
## [179] "How we can turn the tide on climate"                                                        
## [180] "Plant fuels that could power a jet"                                                         
## [181] "Amazon's climate pledge to be net-zero by 2040"                                             
## [182] "The discoveries awaiting us in the ocean's twilight zone"                                   
## [183] "Ecology from the air"                                                                       
## [184] "How we can detect pretty much anything"                                                     
## [185] "Hopeful lessons from the battle to save rainforests"                                        
## [186] "How we look kilometers below the Antarctic ice sheet"                                       
## [187] "My country will be underwater soon — unless we work together"                               
## [188] "The secret life of plankton"                                                                
## [189] "The big-beaked, rock-munching fish that protect coral reefs"                                
## [190] "Energy from floating algae pods"                                                            
## [191] "Apple's promise to be carbon neutral by 2030"                                               
## [192] "Urbanization and the evolution of cities across 10,000 years"                               
## [193] "Let's scan the whole planet with LiDAR"                                                     
## [194] "Why are blue whales so enormous?"                                                           
## [195] "Why is cotton in everything?"                                                               
## [196] "The Arctic vs. the Antarctic"                                                               
## [197] "The lovable (and lethal) sea lion"                                                          
## [198] "Why I still have hope for coral reefs"                                                      
## [199] "Is the weather actually becoming more extreme?"                                             
## [200] "Climate change is our reality. Here's how we're taking action"                              
## [201] "The biggest risks facing cities — and some solutions"                                       
## [202] "Make your actions on climate reflect your words"                                            
## [203] "Why climate change is a threat to human rights"                                             
## [204] "What a nun can teach a scientist about ecology"                                             
## [205] "How the military fights climate change"                                                     
## [206] "A brief history of divorce"                                                                 
## [207] "Love vs. Honor: The Irish myth of Diarmuid's betrayal"                                      
## [208] "What emotions look like in a dog's brain"                                                   
## [209] "Beautiful new words to describe obscure emotions"                                           
## [210] "Technology hasn't changed love. Here's why"                                                 
## [211] "How reverse mentorship can help create better leaders"                                      
## [212] "\"First Kiss\""                                                                             
## [213] "What you don't know about marriage"                                                         
## [214] "Fifty shades of gay"                                                                        
## [215] "Intimate photos of a senior love triangle"                                                  
## [216] "How understanding divorce can help your marriage"                                           
## [217] "Want to change the world? Start by being brave enough to care"                              
## [218] "Why we love, why we cheat"                                                                  
## [219] "The keys to a happier, healthier sex life"                                                  
## [220] "\"Everything happens for a reason\" — and other lies I've loved"                            
## [221] "4 signs of emotional abuse"                                                                 
## [222] "Say your truths and seek them in others"                                                    
## [223] "How to speak up for yourself"                                                               
## [224] "The truth about faking orgasms"                                                             
## [225] "The therapeutic value of photography"                                                       
## [226] "The myth of the original star-crossed lovers"                                               
## [227] "How to co-parent as allies, not adversaries"                                                
## [228] "Are we designed to be sexual omnivores?"                                                    
## [229] "The science of sex"                                                                         
## [230] "The office without a**holes"                                                                
## [231] "How to support yourself (and others) through grief"                                         
## [232] "5 ways to create stronger connections"                                                      
## [233] "Why US laws must expand beyond the nuclear family"                                          
## [234] "How to avoid catching prickly emotions from other people"                                   
## [235] "The 100 tampons NASA (almost) sent to space — and other absurd songs"                       
## [236] "How to speed up chemical reactions (and get a date)"                                        
## [237] "What makes life worth living in the face of death"                                          
## [238] "The emotions behind your money habits"                                                      
## [239] "How to have constructive conversations"                                                     
## [240] "What makes a friendship last?"                                                              
## [241] "The secret to great opportunities? The person you haven't met yet"                          
## [242] "How to stop swiping and find your person on dating apps"                                    
## [243] "The beauty and complexity of finding common ground"                                         
## [244] "How to discover your \"why\" in difficult times"                                            
## [245] "A second chance for fathers to connect with their kids"                                     
## [246] "Ethical dilemma: Who should you believe?"                                                   
## [247] "How compassion could save your strained relationships"                                      
## [248] "This could be why you're depressed or anxious"                                              
## [249] "How couples can sustain a strong sexual connection for a lifetime"                          
## [250] "There's more to life than being happy"                                                      
## [251] "The science behind how parents affect child development"                                    
## [252] "What young women believe about their own sexual pleasure"                                   
## [253] "The brain in love"                                                                          
## [254] "Why art is a tool for hope"                                                                 
## [255] "The profound power of gratitude and \"living eulogies\""                                    
## [256] "A love story about the power of art as organizing"                                          
## [257] "Is it really that bad to marry my cousin?"                                                  
## [258] "The necessity of normalizing queer love"                                                    
## [259] "The legend of Annapurna, Hindu goddess of nourishment"                                      
## [260] "A sci-fi vision of love from a 318-year-old hologram"                                       
## [261] "The lost art of letter-writing"                                                             
## [262] "Ideas worth dating"                                                                         
## [263] "Rethinking thinking"                                                                        
## [264] "A sex therapist's secret to rediscovering your spark"                                       
## [265] "The uncomplicated truth about women's sexuality"                                            
## [266] "This is what LGBT life is like around the world"                                            
## [267] "Why I photograph the quiet moments of grief and loss"                                       
## [268] "What is love?"                                                                              
## [269] "What almost dying taught me about living"                                                   
## [270] "How to raise kids who can overcome anxiety"                                                 
## [271] "How Dolly Parton led me to an epiphany"                                                     
## [272] "4 kinds of regret — and what they teach you about yourself"                                 
## [273] "What you discover when you really listen"                                                   
## [274] "The money talk that every couple needs to have"                                             
## [275] "3 lessons of revolutionary love in a time of rage"                                          
## [276] "7 common questions about workplace romance"                                                 
## [277] "The relationship between sex and imagination"                                               
## [278] "The journey through loss and grief"                                                         
## [279] "The benefits of not being a jerk to yourself"                                               
## [280] "Why bittersweet emotions underscore life's beauty"                                          
## [281] "How friendship affects your brain"                                                          
## [282] "Rethinking infidelity ... a talk for anyone who has ever loved"                             
## [283] "Why domestic violence victims don't leave"                                                  
## [284] "How peer educators can transform sex education"                                             
## [285] "An ode to envy"                                                                             
## [286] "The myth of Zeus' test"                                                                     
## [287] "\"Accents\""                                                                                
## [288] "The mathematics of love"                                                                    
## [289] "Should you care what your parents think?"                                                   
## [290] "On tennis, love and motherhood"                                                             
## [291] "A little-told tale of sex and sensuality"                                                   
## [292] "Why do we love? A philosophical inquiry"                                                    
## [293] "The difference between healthy and unhealthy love"                                          
## [294] "How to preserve your private life in the age of social media"                               
## [295] "Sex education should start with consent"                                                    
## [296] "This is what enduring love looks like"                                                      
## [297] "What makes a good life? Lessons from the longest study on happiness"                        
## [298] "A queer vision of love and marriage"                                                        
## [299] "The mood-boosting power of crying"                                                          
## [300] "Love others to love yourself"                                                               
## [301] "What we can do about the culture of hate"                                                   
## [302] "The routines, rituals and boundaries we need in stressful times"                            
## [303] "What\xcayou\xcacan\xcalearn\xcafrom\xcapeople\xcawho\xcadisagree\xcawith\xcayou"            
## [304] "You\xcaare\xcanot\xcaalone\xcain\xcayour\xcaloneliness"
Topics Count
AI 103
Climate change 86
Relationships 97

Subsequently, it turned to data parsing step. We converted posting time, the number of likes, the number of views to be in the appropriate format for further analyses. For example, the posting time for the first video, “How does artificial intelligence learn?”, was Mar 2021 in the original TED table. It was converted to be 2021-03-01.

For the transcript, there were the number of translated languages and the details of the translation at the beginning of the transcript every video. In this project, we focus only the actual transcript. Thus, we removed this part out of the transcript. For example, the first sentence of the transcript in the first video,“How does artificial intelligence learn?”, was “Transcript (28 Languages)Bahasa IndonesiaDeutschEnglishEspañolFrançaisItalianoMagyarPolskiPortuguês brasileiroPortuguês de PortugalRomânăTiếng ViệtTürkçeΕλληνικάРусскийСрпски, Srpskiעבריתالعربيةفارسىکوردی سۆرانیবাংলাதமிழ்ภาษาไทยမြန်မာဘာသာ中文 (简体)中文 (繁體)日本語한국어”. We removed this part out of the transcript.

We also converted the topic names, AI, Climate change, and Relationships, to be numbers 1, 2, and 3 under cate variable in TED, respectively. This would be easy to keep track of the videos in supervised and unsupervised learning analyses.

Due to the limited number of available videos within the selected topics on TED website, we could not scrape more videos for unsupervised and supervised learning analyses, and we would like to obtain a robust model as well as avoid the overfitting problem. Therefore, we decided to increase the number of our observations by setting up a window of 20 sentences to be equal to 1 observation as we noticed that the transcripts each video comprised of more than 20 sentences.

We split sentences by using the tokenize_sentence function from quanteda package and created a new variable, namely sub cate. For example, sub_cate of 1.1 indicates that the observation is from the first transcript in AI (The first topic). We then created a text variable to indicate and identify each text. For example, a text of 1.1.1 indicates that this text comes from the first 20 sentences of the first transcript in AI topic. By doing this, the number of observations increases from 286 to 1,471 and we named this data frame as TED_full. To sum up, TED_full consists of 1,471 observations with 7 variables which are posted, cate, like, view, subcate, text, tanscript.

The example of TED_full table
posted cate like view subcate text tanscript
10 2014-12-01 1 80000 2693800 1.3 1.3.4 In fact, deep learning has done more than that. Complex, nuanced sentences like this one are now understandable with deep learning algorithms. As you can see here, this Stanford-based system showing the red dot at the top has figured out that this sentence is expressing negative sentiment. Deep learning now in fact is near human performance at understanding what sentences are about and what it is saying about those things. Also, deep learning has been used to read Chinese, again at about native Chinese speaker level. This algorithm developed out of Switzerland by people, none of whom speak or understand any Chinese. As I say, using deep learning is about the best system in the world for this, even compared to native human understanding. : This is a system that we put together at my company which shows putting all this stuff together. These are pictures which have no text attached, and as I’m typing in here sentences, in real time it’s understanding these pictures and figuring out what they’re about and finding pictures that are similar to the text that I’m writing. So you can see, it’s actually understanding my sentences and actually understanding these pictures. I know that you’ve seen something like this on Google, where you can type in things and it will show you pictures, but actually what it’s doing is it’s searching the webpage for the text. This is very different from actually understanding the images. This is something that computers have only been able to do for the first time in the last few months. : footnotefootnoteSo we can see now that computers can not only see but they can also read, and, of course, we’ve shown that they can understand what they hear. Perhaps not surprising now that I’m going to tell you they can write. Here is some text that I generated using a deep learning algorithm yesterday. And here is some text that an algorithm out of Stanford generated. Each of these sentences was generated by a deep learning algorithm to describe each of those pictures. This algorithm before has never seen a man in a black shirt playing a guitar. It’s seen a man before, it’s seen black before, it’s seen a guitar before, but it has independently generated this novel description of this picture. We’re still not quite at human performance here, but we’re close. In tests, humans prefer the computer-generated caption one out of four times.

Tokenization

We tokenized our transcript by quanteda package aiming to receive Document-Term Matrix and TFIDF matrix. In this section, we performed the tokenization twice. First, we tokenized TED, which consists of 286 videos/observations, to gain access into hidden insights each video and to observe the similarity and dissimilarity of each video. Second, we tokenized TED_full, which consists of 1,471 instances, for unsupervised and supervised learning analyses.

Tokenization from TED

We applied corpus() and tokens() functions to the tanscript variable to remove numbers, all characters in the “punctuation”, symbols, and separators. We then removed stop words from the SMART information retrieval system in English (571 words) and also deleted 2 more words, applaud and laughter, that they appear often in our transcript as sound representation. Sound representation in a transcript is one of the translated functionality of TED meant to enable deaf and hard-of-hearing viewers to understand all the non-spoken auditory information. Afterward, we performed lemmatization and named the data frame as TED.tk1.

To obtain the Document-Term Matrix and the TFIDF matrix, we used dfm() and dfm_tfidf() functions, respectively. The first 10 terms and 10 documents (videos) are shown below. Additionally, the frequencies per terms can simply be obtained using textstat_frequency() as presented in the last table.

## Corpus consisting of 286 documents, showing 100 documents:
## 
##     Text Types Tokens Sentences
##    text1   349    744        32
##    text2   520   1896        72
##    text3   879   3943       161
##    text4   636   2504        96
##    text5   787   2796       124
##    text6   618   2495       108
##    text7   491   1510        71
##    text8   909   3320       161
##    text9   708   2216        73
##   text10   705   2731       121
##   text11   393    889        28
##   text12  1847  12092       470
##   text13   313    928        37
##   text14   420   1080        41
##   text15   373   1137        31
##   text16   567   1899       100
##   text17   513   2004        86
##   text18   365   1262        57
##   text19   888   3067       102
##   text20   594   2462        94
##   text21   296    759        31
##   text22   781   2796       114
##   text23   664   1698        62
##   text24   381    832        30
##   text25   644   1838        54
##   text26   738   2943        93
##   text27   336    705        26
##   text28   482   1385        50
##   text29   192    338        16
##   text30   619   1905        87
##   text31   738   2571       110
##   text32   528   1482        81
##   text33   891   2824       184
##   text34   585   2088        67
##   text35   743   2858       109
##   text36   669   1856        70
##   text37   593   1610        82
##   text38   519   1575        63
##   text39   802   3103       127
##   text40   991   3974       192
##   text41   514   1372        54
##   text42   824   2616        99
##   text43   663   2099        87
##   text44   844   4577       193
##   text45   608   1975        89
##   text46   909   2933       146
##   text47   672   3171       133
##   text48   725   2682       129
##   text49   813   3451       125
##   text50   630   1855        84
##   text51   861   2616       104
##   text52   398   1233        52
##   text53   486   1540        74
##   text54   538   1479        76
##   text55   964   4322       134
##   text56   798   3339       150
##   text57   359    761        27
##   text58   766   2952       130
##   text59   749   2975       147
##   text60   817   3595       206
##   text61   546   1683       119
##   text62   633   2108        93
##   text63   486   1331        50
##   text64   691   2067       111
##   text65   351    660        31
##   text66   941   5057       297
##   text67   607   2119        74
##   text68   875   3854       118
##   text69   849   2889       107
##   text70   722   2231        79
##   text71   376    831        46
##   text72   483   1089        37
##   text73   562   2314        79
##   text74   483   1089        37
##   text75   572   1796        84
##   text76   572   1796        84
##   text77   681   2618       116
##   text78   739   3401       127
##   text79   497   1095        32
##   text80   394    986        64
##   text81   885   2488       110
##   text82   321    600        25
##   text83   431   1036        43
##   text84   524   1920        58
##   text85   300    841        28
##   text86   986   4554       216
##   text87   950   3541       198
##   text88  1898  13346       635
##   text89   723   1981       107
##   text90   558   1889       100
##   text91  1144   5082       234
##   text92   557   1624        74
##   text93   468   1288        65
##   text94   384   1100        35
##   text95   713   2239       108
##   text96   498   1241        57
##   text97   672   1818        82
##   text98   510   1364        70
##   text99   529   1348        44
##  text100   437   1104        41
The example of document-feature matrix
doc_id today artificial intelligence help doctor diagnose patient pilot fly commercial
text1 1 3 2 1 6 4 11 1 1 1
text2 0 2 2 0 0 0 0 0 0 0
text3 1 0 0 0 2 0 0 0 0 1
text4 0 1 1 0 0 0 0 0 0 0
text5 1 0 0 0 0 0 0 0 2 0
text6 3 12 10 0 2 1 0 1 2 0
text7 3 3 7 4 1 1 0 0 0 0
text8 1 8 10 0 0 0 0 0 0 0
text9 2 2 2 5 0 1 0 0 0 0
text10 3 2 2 0 0 0 0 0 0 0
The example of TFIDF matrix
doc_id today artificial intelligence help doctor diagnose patient pilot fly commercial
text1 0.2716746 1.8525508 1.1980671 0.4520447 5.0614931 4.378553 10.61505 1.225917 0.8129134 1.280275
text2 0.0000000 1.2350339 1.1980671 0.0000000 0.0000000 0.000000 0.00000 0.000000 0.0000000 0.000000
text3 0.2716746 0.0000000 0.0000000 0.0000000 1.6871644 0.000000 0.00000 0.000000 0.0000000 1.280275
text4 0.0000000 0.6175169 0.5990335 0.0000000 0.0000000 0.000000 0.00000 0.000000 0.0000000 0.000000
text5 0.2716746 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.00000 0.000000 1.6258267 0.000000
text6 0.8150238 7.4102033 5.9903354 0.0000000 1.6871644 1.094638 0.00000 1.225917 1.6258267 0.000000
text7 0.8150238 1.8525508 4.1932348 1.8081786 0.8435822 1.094638 0.00000 0.000000 0.0000000 0.000000
text8 0.2716746 4.9401355 5.9903354 0.0000000 0.0000000 0.000000 0.00000 0.000000 0.0000000 0.000000
text9 0.5433492 1.2350339 1.1980671 2.2602233 0.0000000 1.094638 0.00000 0.000000 0.0000000 0.000000
text10 0.8150238 1.2350339 1.1980671 0.0000000 0.0000000 0.000000 0.00000 0.000000 0.0000000 0.000000
##    feature frequency rank docfreq group
## 1   people      1992    1     239   all
## 2     make      1616    2     269   all
## 3    thing      1505    3     234   all
## 4     time      1290    4     258   all
## 5     year      1286    5     253   all
## 6     work      1175    6     238   all
## 7    human      1071    7     191   all
## 8    world       940    8     216   all
## 9     life       796    9     207   all
## 10    love       780   10     148   all

Tokenization from TED_full for unsupervised and supervised learning analyses

I aim to create DTM, TFIDF tables by quanteda package.

## Corpus consisting of 1471 documents, showing 100 documents:
## 
##     Text Types Tokens Sentences
##    text1   240    453        20
##    text2   182    291        12
##    text3   222    499        20
##    text4   232    555        20
##    text5   221    550        20
##    text6   130    292        12
##    text7   238    551        20
##    text8   246    526        20
##    text9   217    476        20
##   text10   192    451        20
##   text11   258    579        20
##   text12   207    548        20
##   text13   184    367        20
##   text14   210    442        20
##   text15     3      3         1
##   text16   255    562        20
##   text17   204    477        20
##   text18   236    566        20
##   text19   188    468        20
##   text20   195    431        16
##   text21   165    328        20
##   text22   226    462        20
##   text23   213    409        20
##   text24   235    459        20
##   text25   258    563        20
##   text26   224    515        20
##   text27    48     60         4
##   text28   200    435        20
##   text29   220    555        20
##   text30   235    555        20
##   text31   183    468        20
##   text32   150    335        20
##   text33    79    147         8
##   text34   191    387        20
##   text35   156    313        20
##   text36   206    427        20
##   text37   182    383        11
##   text38   190    337        20
##   text39   164    333        20
##   text40   197    384        20
##   text41   232    483        20
##   text42   227    469        20
##   text43   153    305        20
##   text44   268    555        20
##   text45   199    380        20
##   text46    51     74         1
##   text47   297    681        20
##   text48   287    600        20
##   text49   275    608        20
##   text50   156    327        13
##   text51   182    388        20
##   text52   169    394        20
##   text53   166    387        20
##   text54   247    586        20
##   text55   196    428        20
##   text56   251    533        20
##   text57    13     15         1
##   text58   310    606        20
##   text59   155    283         8
##   text60   178    399        20
##   text61   205    499        20
##   text62   197    479        20
##   text63   239    564        20
##   text64   229    544        20
##   text65   264    629        20
##   text66   235    556        20
##   text67   249    571        20
##   text68   203    522        20
##   text69   222    540        20
##   text70   218    469        20
##   text71   248    598        20
##   text72   181    407        20
##   text73   173    342        20
##   text74   184    407        20
##   text75   260    639        20
##   text76   196    492        20
##   text77   251    588        20
##   text78   197    430        20
##   text79   245    597        20
##   text80   233    584        20
##   text81   226    520        20
##   text82   216    522        20
##   text83   106    194        10
##   text84   197    495        20
##   text85   204    433        17
##   text86   285    601        20
##   text87   225    445        20
##   text88    22     34         1
##   text89   242    609        20
##   text90   204    528        11
##   text91   218    450        20
##   text92   207    403        20
##   text93   133    253        20
##   text94   179    379        20
##   text95   214    414        20
##   text96   168    327        20
##   text97   180    463        20
##   text98   216    515        20
##   text99   199    543        20
##  text100    85    156         6
## Tokens consisting of 1,471 documents.
## text1 :
##  [1] "today"        "artificial"   "intelligence" "help"         "doctor"      
##  [6] "diagnose"     "patient"      "pilot"        "fly"          "commercial"  
## [11] "aircraft"     "city"        
## [ ... and 192 more ]
## 
## text2 :
##  [1] "treatment"  "progress"   "program"    "receive"    "feedback"  
##  [6] "constantly" "update"     "plan"       "patient"    "technique" 
## [11] "inherently" "smart"     
## [ ... and 109 more ]
## 
## text3 :
##  [1] "artificial"   "intelligence" "disrupt"      "kind"         "industry"    
##  [6] "ice"          "cream"        "kind"         "mind-blowing" "flavor"      
## [11] "generate"     "power"       
## [ ... and 124 more ]
## 
## text4 :
##  [1] "turn"        "ai"          "solve"       "problem"     "assemble"   
##  [6] "tower"       "fall"        "land"        "point"       "technically"
## [11] "solve"       "problem"    
## [ ... and 135 more ]
## 
## text5 :
##  [1] "think"       "nice"        "paint"       "color"       "name"       
##  [6] "imitate"     "kind"        "letter"      "combination" "original"   
## [11] "word"        "word"       
## [ ... and 149 more ]
## 
## text6 :
##  [1] "ai"           "suppose"      "copy"         "thing"        "human"       
##  [6] "technically"  "ask"          "accidentally" "ask"          "wrong"       
## [11] "thing"        "time"        
## [ ... and 68 more ]
## 
## [ reached max_ndoc ... 1,465 more documents ]

DTM

## Document-feature matrix of: 6 documents, 15,045 features (99.39% sparse) and 0 docvars.
##        features
## docs    today artificial intelligence help doctor diagnose patient pilot fly
##   text1     1          2            2    1      6        4       8     1   1
##   text2     0          1            0    0      0        0       3     0   0
##   text3     0          2            2    0      0        0       0     0   0
##   text4     0          0            0    0      0        0       0     0   0
##   text5     0          0            0    0      0        0       0     0   0
##   text6     0          0            0    0      0        0       0     0   0
##        features
## docs    commercial
##   text1          1
##   text2          0
##   text3          0
##   text4          0
##   text5          0
##   text6          0
## [ reached max_nfeat ... 15,035 more features ]
##    feature frequency rank docfreq group
## 1   people      1992    1     822   all
## 2     make      1616    2     858   all
## 3    thing      1505    3     760   all
## 4     time      1290    4     774   all
## 5     year      1286    5     684   all
## 6     work      1175    6     648   all
## 7    human      1071    7     478   all
## 8    world       940    8     547   all
## 9     life       796    9     447   all
## 10    love       780   10     325   all
## 11   start       721   11     457   all
## 12    feel       693   12     389   all
## 13    talk       689   13     441   all
## 14      ai       681   14     179   all
## 15    call       653   15     447   all
## 16  change       636   16     375   all
## 17    find       631   17     418   all
## 18     lot       629   18     424   all
## 19    kind       624   19     405   all
## 20   learn       618   20     325   all

TF-IDF

##   regret  divorce    sekou     cell   cousin  vulture   farmer      chk 
## 41.06482 38.19748 35.50333 35.45800 34.83772 34.39899 34.10429 31.67613 
##   holmes    cloud 
## 31.67613 30.69386

4. EDA

-Analysis of the word frequencies: compute and show frequencies and TF-IDF. -Comparison of the speeches in terms of lexical diversity -Comparison of two speeches in terms of key words (in the answer below Trump=target vs. Obama=reference). -Show links between terms (compute co-occurrences and try to build a network)

Plot of Frequencies and TF-IDF

-because we have 3 different topics so we dont expect topic specific words in the most frequent used terms. Altho we can see some words related to our topic like love, ai, kind, world, human

Text 12: EM = Elon Musk, CA = Chris Anderson: Both words have high TFIDF because they are specific to this text.

link to the number of sample we have in cate: numbers of Relationships vdo are the highest among the others

Lexical Diversity

TTR: The lexical diversity analysis is conclusive for these data. We can see that text237, text131 have the highest richness of vocabulary among the other documents.

Find another way to set up window or remove?

Keyness

Do one chart per topic

AI we can see that the keyness of terms in text1 compared to all the others include unsupervised supervised patient treatment.

5. Sentiment analysis

In this part, we use two dictionary: AFINN, NRC and method of Valence-Shifters to do sentiment analysis for every video’s transcript, which means we don’t split the transcript by every 20 sentences. It might be better to see if the sentiment of video would influence the other features. Here, we have following hypothesis:

  • The sentiment of videos are related to the corresponding topic. Topic like AI may have more positive sentiment.
  • The sentiment of videos influence users giving likes. Positive videos usually have more likes.
  • The sentiment of videos changed over years, since it might related to the corresponding affairs.

5.1 Sentiment Based

First, we use NRC method to check the sentiment description of each videos transcript. as it’s sentiment based method, we would like to only check the relationship with videos’ topics and their likes.

5.1.1 Sentiment v.s. Likes

Since there are near to 300 transcript (videos), we would like to extract 20 videos with the most likes and the least likes, respectively.

Re-scale sentiment by their length:

In this part, we did NRC method in two different ways, one without scaling and another with re-scaling the sentiment by their length in the documents.

No matter in which way we can see that there are no obvious difference among the videos with the most likes and the least likes, in terms of their likes. Positive and anticipation appearing in everywhere. In some top 20 videos, we could also see the negative and fear sentiment with relatively high levels.

5.1.2 Sentiment v.s. Topics

we would like to check what is the more frequent sentiment appearing in each topics. we suppose the topic like climate change is more related to negative or fear sentiment, and for the topic like AI, we could see anticipation or positive sentiment more frequent.

As we assume, The topic of AI is often accompanied by positive and anticipation, and we could not ignore trust. Yet, we could see that negative also accounts for a not small part. Contrary to our speculation, the topic of climate change has the same positive sentiment which is also the most frequent part in this topic. And, each sentiment is more evenly distributed in the videos on the topic of relationships, even though the positive sentiment is still the most.

In this case, we begin to assume that positive sentiment actually is the main sentiment in TED talk showing in all videos, based on previous analysis.

5.2 Value-Based

Thus, except for the initial assumption, we would like to check one more assumption - if the positive sentiment appears in all videos, by using the value-based method: Afinn.

Here, we calculated the average sentiment score per video. We can see that the number of videos transcript with positive and negative values is very disparate. Thus, TED talk do prefer giving positive videos.

5.2.1 Sentiment v.s. Likes

Since the number of likes for each video is relatively similar, and only individual videos have a large number of likes, we separate each category to observe the distribution of the number of likes and sentiment values. We can observe there are no obvious pattern as well. Only some videos in Climate change topic have negative sentiment and the less numbers of likes.

5.2.2 Sentiment v.s. topics

The sentiment values for each topic are relatively similar, and they are all in the upper-middle range - more positive. Among the topics of climate change and AI, the sentiment value of each video is more evenly distributed. AI topic has two outliners with lowest values, the most negative.

5.2.3 Sentiment over years

The talks on relationship topics have great fluctuations. Its mean sentiment value reached the lowest values before 2005 and around 2013. Climate change related topics have seen a decline in mean sentiment value in recent years. Yet, It is worth noting that the positive value of videos posted after 2020 is getting higher.

5.3 Using Valence-Shifters

Meantime, we would like to check if there is a huge difference after using valence-shifters.

First, we can see that the sentiment values are distributed as similar as the one without using Valence-Shifters. After counting the number of videos transcripts with negative values, we found there are 31 videos having negative values before taking negative form into account, and 19 videos having negative values after considering negative form.

LSA

LSA on TF

First, we build the LSA object and use 4 dimensions. Latent Semantic Analysis(LSA) decomposes this DTM(TED.dfm) into 3 matrices (\(M = U\Sigma V^{t}\)), centred around 4 topics. We check the 3 matrices: U:Doc-topic sim, Σ:Topic strength and V:Terms-topic sim.

##              [,1]       [,2]         [,3]         [,4]
## text1 -0.02798334 0.05617722 -0.006232811  0.001081462
## text2 -0.02150955 0.04634971 -0.001429383 -0.010344754
## text3 -0.03070206 0.08594805  0.011743288 -0.008303698
## text4 -0.03807238 0.11359573  0.010462918 -0.030762145
## text5 -0.03562075 0.08850081  0.010682370 -0.066064393
## text6 -0.02464409 0.08177654  0.010806598 -0.053295403

This Doc-topic sim table shows the link between each text and each topic. For example, text1 most relevant to dimension 2(topic 2).

## [1] 183.84893  85.97943  81.27583  73.82645

These values represent the strength of each topic. Except for the topic 1, topic 2 has the largest strength.

##                      [,1]        [,2]         [,3]         [,4]
## today        -0.056996586 0.012678495 -0.048550264 -0.015598272
## artificial   -0.030316542 0.068378034 -0.002665452 -0.003973498
## intelligence -0.051675811 0.132033185  0.003634968 -0.010273700
## help         -0.025376025 0.002096622  0.003236197 -0.012198151
## doctor       -0.013845930 0.003662971  0.007149488 -0.004113176
## diagnose     -0.005836709 0.009556468  0.002316538 -0.003080563

This Terms-topic sim table shows the link between each term and each topic. For example, the term “artificial” most relevant to dimension 2(topic 2).

The first dimension of LSA is often correlated to (of little information and often not represented) the document length and the frequency of the term. We build a scatter-plot between the document length and Dimension 1 to demonstrate it.

Then we check the top words for dimension2, 3, and 4.

##         ai      human      robot    machine      datum       feel    climate 
##  0.5115474  0.3948890  0.1953350  0.1781869  0.1514245 -0.1080379 -0.1133600 
##       life       love     people 
## -0.1259553 -0.2119132 -0.2781561

Dimension 2 is associated positively with word like “ai”, “human”, “robot”, “machine” ,“datum”, and negatively associated with “feel”, “climate”, “life”, “love”, “people”.

##     people       love      robot       feel       life     forest       year 
##  0.2709439  0.2628304  0.1889753  0.1307637  0.1043617 -0.1520730 -0.1816735 
##     energy     carbon    climate 
## -0.1888159 -0.1949385 -0.2867198

Dimension 3 is associated positively with word like “people”, “love”, “robot”, “fell” ,“life”, and negatively associated with “forest”, “year”, “energy”, “carbon”, “climate”.

##       robot       thing        rule        move       start       datum 
##  0.77146585  0.12652233  0.10098575  0.09324200  0.07307601 -0.09207113 
##       human        love      people          ai 
## -0.12819844 -0.13269881 -0.19402622 -0.35842629

Dimension 4 is associated positively with word like “robot”, “thing”, “rule”, “move” ,“start”, and negatively associated with “datum”, “human”, “love”, “people”, “ai”.

In order to check the relation between LSA and category of text, we combine the LSA result with the category of document and represent every text on these two following plots. First plot:x-axis is dimension 2 and y-axis is dimension3. According to this plot, most of the texts of category “Climate change” are negatively associated with dimension3. Most of the texts of category “Relationships” are positively associated with dimension3. And most of the category “AI” are positively associated with dimension2.

Second plot:x-axis is dimension 3 and y-axis is dimension4. According to this plot, most texts of category “AI” are positively associated with dimension4, most of the texts of category “Climate change” are negatively associated with dimension3, and most of the texts of category “Relationships” are positively associated with dimension3.

LSA on TF-IDF

Repeat the LSA with the TF-IDF as DTM. Check whether the weighted frequency can make the LSA results better interpret texts.

##              [,1]        [,2]        [,3]        [,4]
## text1 -0.04224960 -0.03292790 -0.03208569 -0.03944124
## text2 -0.02756965 -0.02249735 -0.02491866 -0.03087374
## text3 -0.03527439 -0.03943897 -0.05735032 -0.03177687
## text4 -0.03879397 -0.04409469 -0.06904558 -0.04798353
## text5 -0.04257262 -0.03371239 -0.04569031 -0.07253177
## text6 -0.02134126 -0.02747207 -0.03317418 -0.05998222
## [1] 148.71774  92.26676  83.34253  79.22164
##                     [,1]         [,2]         [,3]         [,4]
## today        -0.05366875  0.009569134 -0.006827088 -0.024128559
## artificial   -0.04159682 -0.037865030 -0.048246202 -0.048205992
## intelligence -0.06458133 -0.067969303 -0.077966876 -0.073515045
## help         -0.03272158 -0.008027673  0.011084504 -0.007494126
## doctor       -0.02390332 -0.018771576  0.008266465 -0.011336584
## diagnose     -0.01334579 -0.014180559 -0.007582013 -0.016581481

We also check the top words for dimension2, 3, and 4 of LSA on TF-IDF.

##      forest      carbon     climate    emission      energy       human 
##  0.21762909  0.20846727  0.17702682  0.17680258  0.14348312 -0.07828712 
##    computer     machine          ai       robot 
## -0.08208328 -0.08284244 -0.16431673 -0.28294361

For this LSA, dimension 2 is associated positively with word like “forest”, “carbon”, “climate”, “emission” ,“energy”, and negatively associated with “human”, “computer”, “machine”, “ai”, “robot”.

##      regret         sex       woman        love         man       datum 
##  0.25686981  0.23120913  0.16005250  0.15399852  0.11559716 -0.07911780 
##     machine        rule          ai       robot 
## -0.08268746 -0.08726087 -0.25484305 -0.44130868

Dimension 3 is associated positively with word like “regret”, “sex”, “woman”, “love” ,“man”, and negatively associated with “datum”, “machine”, “rule”, “ai”, “robot”.

##       robot        rule         bee     seaweed       coral     machine 
##  0.62140539  0.13220627  0.12473378  0.10947843  0.10684442 -0.07824845 
##       human     company       datum          ai 
## -0.08784392 -0.10995016 -0.14675300 -0.40634094

Dimension 4 is associated positively with word like “robot”, “rule”, “bee”, “seaweed” ,“coral”, and negatively associated with “machine”, “human”, “company”, “datum”, “ai”.

We also check the relation between this LSA result and category of text, we combine the LSA result with the category of document and represent every text on these two following plots. First plot:x-axis is dimension 2 and y-axis is dimension3. According to this plot, most of the texts of category “Climate change” are positively associated with dimension2. Most of the texts of category “Relationships” are positively associated with dimension3. And most of the category “AI” are negatively associated with dimension2 and dimension3.

Second plot:x-axis is dimension 3 and y-axis is dimension4. According to this plot, most texts of category “AI” are associated with dimension4. But part of texts of category “AI” are negatively associated with dimension 4 and part of texts of category “AI” are positively associated with dimension 4. The pattern is not very clear.

LDA

We now turn to Latent Dirichlet Association (LDA). LDA is a Bayesian model for topic modeling: generative model. It is also to discover topics in a collection of documents. For the illustration, we will make 4 topics again.

First, I check the top5 words in each dimension. For example, the top5 terms for topic 1 are “climate”, “year”, “make”, “change” and “energy”.

##      Topic 1   Topic 2  Topic 3  Topic 4
## [1,] "climate" "people" "people" "robot"
## [2,] "year"    "human"  "love"   "thing"
## [3,] "make"    "ai"     "feel"   "time" 
## [4,] "change"  "make"   "life"   "make" 
## [5,] "energy"  "thing"  "thing"  "brain"

Then I create a table to show the number of documents in each dimension. For example, topic 3 has the highest number of documents(439).

## .
##   1   2   3   4 
## 316 395 439 321

Then I use the topic_diagnostics function to diagnose the prominence, coherence and exclusivity of each dimension.

##   topic_num topic_size mean_token_length dist_from_corpus tf_df_dist
## 1         1   3383.187               5.6        0.4216740   19.70057
## 2         2   3644.634               5.1        0.3793068   22.85747
## 3         3   4124.771               5.2        0.3999282   21.41908
## 4         4   3892.408               4.4        0.4113173   21.01260
##   doc_prominence topic_coherence topic_exclusivity
## 1            418       -86.57526          8.904578
## 2            541       -73.97054          8.708857
## 3            574       -77.63556          8.381961
## 4            435       -67.67438          8.289089

Topic 3 has the largest prominence. Topic 4 has the largest topic coherence and Topic 1 has the smallest topic coherence. Topic 1 has the largest topic exclusivity and Topic 4 has the smallest topic exclusivity.

Topic1 focus on terms like “climate”, “change”, “energy”, “water”. Topic2 focus on terms like”people”, “ai”, “work”, “technology”. Topic3 focus on terms like”love”, “life”, “woman”, “relationship”. Topic4 focus on terms like “robot”, “thing”, “brain”, “human”.

The Climate change related documents mainly talk about Topic 1. The Relationships related documents mainly talk about Topic3. The AI related documents mainly talk about Topic 2 and Topic4.

Embedding

Except for the LSA and LDA, we also want to use embedding to analyze the TED video transcripts. Embedding refers to the representation of elements (documents or tokens) in a Vector Space Model. First we build a word embedding and then we build document embedding that inherits the word co-occurrence. properties.

Word Embedding

The objective is to find a word embedding that reflects the co-occurrences. We use the fcm function from quanteta to compute word co-occurrence.

## Feature co-occurrence matrix of: 6 by 15,045 features.
##               features
## features       today artificial intelligence diagnose fly commercial aircraft
##   today           28          8            8        4   5          0        1
##   artificial       8         30          170        2   2          0        0
##   intelligence     8        170           78        1   0          0        0
##   diagnose         4          2            1        0   1          1        1
##   fly              5          2            0        1   8          2        5
##   commercial       0          0            0        1   2          0        3
##               features
## features       city predict traffic
##   today           5       2       0
##   artificial      0       3       0
##   intelligence    0       6       0
##   diagnose        1       1       0
##   fly             1       1       0
##   commercial      1       1       1
## [ reached max_nfeat ... 15,035 more features ]

Here we show the first several rows of the result. For example, the co-occurrence of word artificial and intelligence is large(170). But the co-occurrence of word fly and intelligence is very small(0).

## INFO  [17:17:44.311] epoch 1, loss 0.0445
## INFO  [17:17:44.390] epoch 2, loss 0.0367
## INFO  [17:17:44.459] epoch 3, loss 0.0343
## INFO  [17:17:44.526] epoch 4, loss 0.0330
## INFO  [17:17:44.593] epoch 5, loss 0.0322
## INFO  [17:17:44.661] epoch 6, loss 0.0318
## INFO  [17:17:44.727] epoch 7, loss 0.0314
## INFO  [17:17:44.793] epoch 8, loss 0.0312
## INFO  [17:17:44.859] epoch 9, loss 0.0310
## INFO  [17:17:44.926] epoch 10, loss 0.0309

For the visualization, we draw two plots. The first one is plotting the vectors of the 100 most used words (100 largest frequencies). Here we show the top6 as an example. The second one is plotting all the words but only labeling part of it.

## [1] "people" "make"   "thing"  "time"   "year"   "work"

Words that are close on the map are often used together. For example, the word “machine” and “intelligence” are close which means they are usually used together. And the word “man” and “woman” are close so they are also usually used together.

The plot shows all the used words in grey and labels a part of words for illustration. According to the plot, the word “warm” and “temperature” are close so these two words are usually used together. Also, the word “cat” and “bird” are close on the map which means they are usually used together.

Document Embedding

We now build the document embedding by computing the centroids of the documents.

## [1] "today"        "artificial"   "intelligence" "help"         "doctor"      
## [6] "diagnose"
##                    [,1]        [,2]
## today        -0.7912934  2.26962581
## artificial    1.0067433  1.35437591
## intelligence  1.4609250  1.77026322
## help          0.2675278  1.23958209
## doctor       -0.1246391  0.61783399
## diagnose     -0.6098596 -0.02239948
## [1] 0.2709277 0.7512324
##            [,1]      [,2]
## text1 0.2709277 0.7512324
## text2 0.4151799 0.9684588
## text3 0.2466364 1.1242400
## text4 0.2350248 1.2377504
## text5 0.3214014 1.0811997
## text6 0.4063020 1.3325672

Now, we make the representation of the documents and use different color to represent the document in diffrent category. According to this plot, the documents in the category “climate change” and “relationships” covers the largest area of each other. So maybe the documents in these two categories are more similar when compared with the documents in the category “AI”.

LSA LDA

Supervised learning(LSA on TF)

First, we use LSA on TF to have a reduced dimension version of the DFM and build a random forest model to predict the category from the LSA.

We build our data frame consisting of the category and the “doc” matrix of the LSA. Along with this, we build the training set index based on a 80/20 split.

Balance the data

The data is unbalanced so We need to use the sub-sampling method to balance the data.

## 
##   1   2   3 
## 450 327 401
## 
##   1   2   3 
## 327 327 327

We now use a random forest to predict the category from the LSA on TF. The resulting accuracy is inspected on the test set.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 82  4  6
##          2  9 71  9
##          3 21  6 85
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8123          
##                  95% CI : (0.7628, 0.8553)
##     No Information Rate : 0.3823          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.718           
##                                           
##  Mcnemar's Test P-Value : 0.01253         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.7321   0.8765   0.8500
## Specificity            0.9448   0.9151   0.8601
## Pos Pred Value         0.8913   0.7978   0.7589
## Neg Pred Value         0.8507   0.9510   0.9171
## Prevalence             0.3823   0.2765   0.3413
## Detection Rate         0.2799   0.2423   0.2901
## Detection Prevalence   0.3140   0.3038   0.3823
## Balanced Accuracy      0.8384   0.8958   0.8551

According to the confusion matrix, the accuracy is 0.8089 and the balanced accuracy for class 1 is 0.8340, for class 2 is 0.8911, and for class 3 is 0.8576.

Supervised learning(LSA on TF-IDF)

Now we repeat the steps and use LSA on TF-IDF to have a reduced dimension version of the DFM and build a random forest model to predict the category from the LSA.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 90  3  4
##          2  5 73  2
##          3 17  5 94
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8771         
##                  95% CI : (0.834, 0.9124)
##     No Information Rate : 0.3823         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.8146         
##                                          
##  Mcnemar's Test P-Value : 0.02004        
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.8036   0.9012   0.9400
## Specificity            0.9613   0.9670   0.8860
## Pos Pred Value         0.9278   0.9125   0.8103
## Neg Pred Value         0.8878   0.9624   0.9661
## Prevalence             0.3823   0.2765   0.3413
## Detection Rate         0.3072   0.2491   0.3208
## Detection Prevalence   0.3311   0.2730   0.3959
## Balanced Accuracy      0.8824   0.9341   0.9130

According to the confusion matrix, the model build on features using LSA on TF-ITF is better than the model build on features using LSA on TF. The accuracy is 0.8805 and the balanced accuracy for class 1 is 0.8897, for class 2 is 0.9317, and for class 3 is 0.9156.

Embeding

Supervised learning

## 
##   1   2   3 
## 450 327 401
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 87  6  9
##          2  5 71  7
##          3 20  4 84
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8259          
##                  95% CI : (0.7776, 0.8676)
##     No Information Rate : 0.3823          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7374          
##                                           
##  Mcnemar's Test P-Value : 0.1659          
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.7768   0.8765   0.8400
## Specificity            0.9171   0.9434   0.8756
## Pos Pred Value         0.8529   0.8554   0.7778
## Neg Pred Value         0.8691   0.9524   0.9135
## Prevalence             0.3823   0.2765   0.3413
## Detection Rate         0.2969   0.2423   0.2867
## Detection Prevalence   0.3481   0.2833   0.3686
## Balanced Accuracy      0.8470   0.9100   0.8578

Combine LSA(TF-IDF), embeding, likes, and views for supervised learning

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 95  3  2
##          2  5 73  2
##          3 12  5 96
## 
## Overall Statistics
##                                          
##                Accuracy : 0.901          
##                  95% CI : (0.861, 0.9327)
##     No Information Rate : 0.3823         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.8506         
##                                          
##  Mcnemar's Test P-Value : 0.03026        
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.8482   0.9012   0.9600
## Specificity            0.9724   0.9670   0.9119
## Pos Pred Value         0.9500   0.9125   0.8496
## Neg Pred Value         0.9119   0.9624   0.9778
## Prevalence             0.3823   0.2765   0.3413
## Detection Rate         0.3242   0.2491   0.3276
## Detection Prevalence   0.3413   0.2730   0.3857
## Balanced Accuracy      0.9103   0.9341   0.9360

Further study and Limitation

One of the limitation is from the data. As the process of obtaining transcript is clicking in each videos to scrape from website, it is hard to have a huge amount of data due to R or our laptop capacity and time limitation. Furthermore, as mentioned in the data preparation part, the structure of the TED website is not perfect for scraping text. Thus, it also brought us difficulties in terms of data richness and diversity. Besides, the distribution of additional features on the TED website is relatively fragmented, such as information about the speakers, comments from viewers, etc. In this case, there are less additional features which are valuable to be used in our analysis.

Since in our analysis, we could get a not-bad prediction in terms of the categories, as an external analyst, we can broaden our research in response to available information. For example, we can also analyze trends in TED talk releases and predict their release schedule, which can help investors understand what kind of talks TED wants or is capable of organizing in the next year or six months and what messages it will deliver to its audience. Also, since many speakers do not have only one talk in the TED talk module. We can also analyze the variations in the wording of their speeches under the prerequisite of the same speakers or the same topics. Now, TED talks has near to 400 options for choosing videos’ categories. From our perspective, it is not perfect for users selecting their favorite topic. Therefore, TED can improve the classification of videos by analyzing what categories viewers’ comments on the videos are more inclined to.

Conclusion

From the sentiment analysis, we could clearly see that TED talks always posts positive speaks, which can inspire audiences to face challenges and consider the issues from a positive side. On the other hand, we can observe that this situation does not distinguish between the categories of videos, so to some extent there is the lack of diversity in term of the videos’ sentiment.

Reference